Skip to content

Full Resumability PR 3: Filtering the completed tasks#1997

Draft
oyilmaz-nvidia wants to merge 16 commits into
NVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/filter-completed-tasks
Draft

Full Resumability PR 3: Filtering the completed tasks#1997
oyilmaz-nvidia wants to merge 16 commits into
NVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/filter-completed-tasks

Conversation

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor

PR Summary

Title

Skip already-completed tasks on pipeline re-run

Summary

Builds on the existing lineage checkpoint (DAG recording + BFS completion marking) to make pipelines resumable: when a stage's input task is already marked completed in the LMDB store, skip it.

Changes

  • nemo_curator/utils/lineage_store.py — adds bulk are_completed(udids) at three layers: LineageStore (single read txn, snapshot-consistent), LineageWriterActor (proxy), and a module-level helper that mirrors the no-op gating of record_lineage / mark_leaves_completed (all-False when Ray isn't initialized or no actor is registered).
  • nemo_curator/stages/base.py — new ProcessingStage._filter_completed_tasks(tasks) method, called as the first step of the default process_batch so completed tasks never reach validate_input or process. Empty _udid (source/unassigned tasks) is never filtered. Order is preserved for survivors.

Contract for stage authors

Stages that override process_batch (typically fan-in / fan-out variants) must call self._filter_completed_tasks(tasks) at the top of the override — same contract as the existing record_lineage / mark_leaves_completed helpers. The inline guidance comment in base.py documents this.

Tests

  • tests/utils/test_lineage_store.py — 10 new tests for bulk are_completed: empty input, all/none/mixed completed, unknown udids, empty-string udids, and module-level helper variants (no Ray, no actor, with actor).
  • tests/stages/common/test_base.py — 6 new tests in TestFilterCompletedTasks covering filter behavior and process_batch integration (completed tasks never reach process; all-completed batch returns [] cleanly).
  • tests/pipelines/test_lineage_integration.py + tests/pipelines/_resumability_runner.py — new end-to-end SIGINT/resume test. A subprocess runs a 4-stage 2000-task pipeline (fan-out → passthrough → chunked fan-in → slow writer) writing to a checkpoint path. The test sends SIGINT after 5 leaves complete, asserts partial completion in LMDB, then relaunches the runner against the same checkpoint and asserts all 4200 DAG nodes end up completed.

Test plan

  • pytest tests/utils/test_lineage_store.py -k are_completed
  • pytest tests/stages/common/test_base.py::TestFilterCompletedTasks
  • pytest tests/pipelines/test_lineage_integration.py::test_resumable_after_sigint (~60s)
  • ruff check nemo_curator/ tests/
  • Existing pipeline tests still pass (no behavior change when checkpoint_path isn't supplied — filter is a silent no-op).

Design notes

  • Bulk are_completed uses a single LMDB read transaction and a single ray.get per stage batch, so per-task remote-call overhead doesn't scale with batch size.
  • The SIGINT test relies on LMDB's sync=True durability: every committed record_lineage / mark_leaves_completed call survives a hard interrupt, while in-flight uncommitted writes are lost — exactly the property the filter needs to handle on resume.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant